INTRODUCTION

This is the Introduction to Open Data Science - course: IODS.

“Open data and content can be freely used, modified, and shared by anyone for any purpose.” (opendefinition.org). However, to make use of all the open data available and scrutinize it properly we need basic skills, tools and lots of motivation. This course will familiarize participants with R R Studio, R Markdown and GitHub. Simultanously all will learn some statistics, too. Every week there will be a lecture and scheduled excercises. Interactive platforms DataCamp, MOOC and GitHub make learning flexible and the enthusiastic supervisory team keep the spirits up even at times of desperation.

I

DATA.

So let’s get started!


My GitHub repository can be found here and course updates can be followed reading my diary posts here.

My research oriented activities are scrutinized here.



1st WEEK: Tools and methods for open and reproducible research

All my course updates can be found in my diary here.

I am a self taught R user and really enjoy learning new things within it. R Markdown and GitHub are completely new to me. The latter seemed logical, but with GitHub it was not love at first sight. Some extra (concrete) headache was (is and will potentially be) created by the outdated computational equipment of mine.

I decided to start with DataCamp. It was fun and encouraging, just like the package Swirl. I have my own mysterious ways of figuring things out, but such approaches provide me with a more structured way of learning or memorizing things. Cool!

Finally I got everything running and could upload my files at my GitHub repository. However, I got a little worried. Perhaps I am not skilled enough for this. But, as is said at Cornell:


2nd WEEK: Regression and model validation

Introduction

This week’s analysis exercise focuses on linear regression analysis: carrying it out, interpreting the results and validating the generated model. Based on a study on learning approaches and students achievements in an introductory statistics course in Finland an original dataset learning2014 including 183 responses and 60 variables was provided. Original data can be found here and information about it metadata here . Briefly, the dataset describes student’ Likert scaled (1-5) ratings of deep, surface and strategic learning styles, attitudes towards statistics, exam points and gender. More specifically, deep approach refers to intention to maximize understanding, surface to memorizing without understanding and strategic to ways students organize their studying.

The original dataset needed to be edited for the exercise 2 analysis according to this R script. Firstly, the variables related to deep, surface and strategic learning were combined and averaged, and variable attitude rescaled. Thereafter, observations with zero exam points were excluded. The final dataset included seven variables:gender, age, attitude, deep, stra, surf and points from 160 respondents.

Basic characteristics of the dataset

Data is read in and the structure of the dataset is checked.

# Read in the dataset
learning2014<-read.csv(file="learning2014.csv", header=TRUE)
# Inspect the structure
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166   7

Summaries

Descriptive statics are shown both as a preliminary summary and as a summary stratified by gender to investigate wheather there are clear differences between males and females. The dataset is further explored using pairs for basic scatter plots and ggpairs command (male=blue, female=red) producing scatter plots, mean plots and histograms.

Basic summaries of the variables

#Basic summaries
summary(learning2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

Grouped summaries by gender. In the package tableone the command CreateTableOne simultanously does group comparisons.

#Grouped summaries
CreateTableOne(vars=c("age", "attitude", "deep", "stra", "surf", "points"), strata=c("gender"),data=learning2014 )
##                       Stratified by gender
##                        F            M            p      test
##   n                      110           56                   
##   age (mean (sd))      24.85 (7.36) 26.80 (8.43)  0.127     
##   attitude (mean (sd))  2.99 (0.73)  3.44 (0.65) <0.001     
##   deep (mean (sd))      3.66 (0.53)  3.72 (0.59)  0.457     
##   stra (mean (sd))      3.20 (0.75)  2.96 (0.80)  0.061     
##   surf (mean (sd))      2.83 (0.46)  2.70 (0.64)  0.148     
##   points (mean (sd))   22.33 (5.83) 23.48 (6.01)  0.234

Graphical overviews

Preliminary scatter plots

#Scatter plots
p1 <- pairs(learning2014[-1], col=learning2014$gender)

A more advanced plot matrix:

#Matrix of plots
p <- ggpairs(learning2014, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)), title="Figure 2. Graphical overview of the learning2014 data")
p

From the output it can be captured that apart from the age all of the variables seem more or less normally distributed. The distributions are similar for both genders apart from the attitude: females seem to have lower average scores (mean(sd)) 2.99(0.73) versus 3.44(0.65), respectively.

The cross-correlation table shows that the highest correlation exists between exam points and attitude. Moreover, there is a slight correlation between surface as well as strategic learning styles with the exam points. The strength of other correlations is not noteworthy. If we split the gender variable, there are some potentially interesting differences in the distributions. However, to keep the scope within reasonable limits with regard to the exercise requirements their further investigation is not carried out.

Fitting regression model

A multiple linear regression model with three explanatory variables (attitude, strategic and surface learning) selected based on the preliminary inspection of the dataset is fitted to investigate the relationship between them and the dependent outcome variable final exam points.

#First model with three explanatory variables
mod<-lm(points~attitude+stra+surf,data=learning2014)
summary(mod)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The summary report of the fitted model shows that attitude has a positive impact on the final exam points. The estimates for strategic and surface learning prove to be non-signicifant.

Other two models are fitted by firstly excluding surf and then stra. As the variables remain non-significant at the .05 level(data not shown) a final model is fitted including only one explanatory variable: attitude.

#Final model with only one explanatory variable
mod2<-lm(points~attitude,data=learning2014)
summary(mod2)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Thus, the regression model \(y=3.5255x_{1}+11.6372\),R2=0.1906, where \(y=\)points and \(x_{1}=\)attitude is chosen. According to the model a one point increase in attitude increases the exam points by 3.5. Explanatory power of the final model has a little smaller R-squared value (0.1906) than did the first one (0.2074) indicating that the model with only one predictor (attitude) predicts appoximately 19% of the variation in students’ exam points.

Diagnostic plots

The variable estimates might be biased if the model does not actually fit the data. Thus, the model fit needs to be explored and is performed here using diagnostic plots: Residals versus Fitted values, normal QQ-plot and residuals vs Leverage.

#Diagnostic plots for the final model
par(mfrow=c(2,2))
plot(mod2,which=c(1,2,5))

Firstly, it is assumed that the size of errors in the model are not dependent on the independent variables. This constant variance of errors assumption seems to hold well as can be seen as the residuals are plotted versus the fitted values. Secondly, the QQ plot implies that deviations of residuals from the theoretical normal distribution seem minimal.Thirdly, standardized residuals plotted against leverage reveal no clear outliers as all observations are grouped close to the plot and no outliers exist.

All these assumptions are valid.


3rd WEEK: Logistic regression

Introduction

Data including grades, demographic, social and school related features were collected in two Portuguese schools using school reports and questionnaires and stored as two separate datasets regarding performance in distinct subjects, namely Mathematics and Portuguese.

The original data of the analysis in this exercise are freely available as a zip file. Additional metadata information describes basic characteristics of the data sets. For the purpose of this particular exercise they needed to be joined and edited according to this R script. The variables not used for joining the two data sets were combined by averaging them. An additional variable high_use was created by taking the average of the sum of alcohol consumption during weekdays and weekends, which was thereafter further modified to yield a logical TRUE or FALSE high_use variable. A treshold value for higher than average alcohol consumption was chosen to be more than 2 weekly proportions.

Preliminary information and descriptive statistics

Basic characteristics of the dataset

Data set is read in and the structure of the dataset is checked.

# Inspect the structure
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school     <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex        <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address    <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize    <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus    <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob       <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob       <fctr> teacher, other, other, services, other, other, oth...
## $ reason     <fctr> course, course, other, home, home, reputation, hom...
## $ nursery    <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet   <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian   <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup  <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup     <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid       <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher     <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic   <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...

The final data set includes 382 respondents and 35 both integer and factorial variables. The names of the variables are listed below (explanations can be found here) :

#Names of the variables
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Variable selection

The purpose is to investigate the relationship between high alcohol consumption and four chosen independent variables. Based on assumed impact and personal interest they are:

Variable Name Explanation Classification
SEX Gender factorial binary F-female/M-male
ABSENCES Number of school abscenses integer 0-93 hours
FAMREL Quality of family relationships integer 1-5 Very bad - Excellent
HEALTH Current health status integer 1-5 Very bad - Very good

Hypotheses

Briefly, I assume that men drink more, more school absences are associated with drinking habits, students with worse family relationships drink more, and better health is associated with less alcohol consumption. Specifically stated hypotheses are: Males are more prone to belonging to high users of alcohol defined by more than 2 proportions a week than females Students with more school absencies drink more than those attending school more frequently. Students with very good family relationships consume less alcohol than do students with bad family relationships. Very good health status is protective for high alcohol consumption (more than 2 weekly proportions) compared to very bad health status.

Graphical overviews

Let´s take the first look graphically: distributions and correlations as well as suitable bar and box plots grouped by high alcohol usage to better understand the relationship between the chosen variables and the outcome. To have multiple figures on the same plot I use the great multiplot script.

#Matrix of plots
ggpairs(alc_study[-1], mapping = aes(col = sex), lower = list(combo = wrap("facethist", bins = 20)), title="Graphical overview of the 4 variables")

p1 <- ggplot(alc_study, aes(sex)) + geom_bar(aes(fill = high_use), position = "dodge", stat="count") + xlab('Gender')+ggtitle("Gender") +scale_fill_manual(values=c("lightblue","red"))+
  theme(plot.title = element_text(size=9))



p2 <- ggplot(alc_study, aes(x=high_use,y= absences,fill=high_use)) + geom_boxplot(outlier.colour="black", outlier.shape=16,
             outlier.size=2, notch=FALSE) +
stat_summary(fun.y=mean, geom="point", shape=23, size=4)+ggtitle("Absences (means included)")+
scale_fill_manual(values=c("lightblue","red"))+
theme(plot.title = element_text(size=9))


p3 <- ggplot(alc_study, aes(x=high_use,y= famrel,fill=high_use)) + geom_boxplot(outlier.colour="black", outlier.shape=16,
             outlier.size=2, notch=FALSE) +
stat_summary(fun.y=mean, geom="point", shape=23, size=4)+ggtitle("Family relationships (means included)")+
scale_fill_manual(values=c("lightblue","red"))+
theme(plot.title = element_text(size=9))




p4 <- ggplot(alc_study, aes(x=high_use,y= health,fill=high_use)) + geom_boxplot(outlier.colour="black", outlier.shape=16,
             outlier.size=2, notch=FALSE) +
stat_summary(fun.y=mean, geom="point", shape=23, size=4)+ggtitle("Health status (means included)")+
scale_fill_manual(values=c("lightblue","red"))+
  theme(plot.title = element_text(size=9))



multiplot(p1, p2, p3, p4, cols = 2)

Grouped summary statistics and preliminary comments

In addition to data visualization, descriptive statistics are shown numerically as a summary grouped by high_use to get a preliminary, quantitative idea of the data set.In the package tableone the command CreateTableOne simultanously does group comparisons.

#Summary

CreateTableOne(vars=c("sex","absences","famrel","health"), strata=c("high_use"),factorVars=c("sex"),data=alc_study )
##                       Stratified by high_use
##                        FALSE        TRUE         p      test
##   n                     268          114                    
##   sex = M (%)           112 (41.8)    72 (63.2)  <0.001     
##   absences (mean (sd)) 3.71 (4.45)  6.37 (6.99)  <0.001     
##   famrel (mean (sd))   4.00 (0.89)  3.78 (0.93)   0.027     
##   health (mean (sd))   3.52 (1.40)  3.70 (1.40)   0.242

There are proportionally more male high users than there are females. Family relationships on average are better for the ones not using a lot of alcohol (4.00 vs 3.78). However, there are no differencies in the median values (4.00 vs 4.00). Both the mean (6.37) and median (4.00) values for absences are higher in the higher usage group than in the group of students consuming less than average amount (3.71,3.00). There are also quite a few outliers in both groups. Surprisingly, the high-user group members have higher current health scores (mean=3.70, median=4.00) than do the non-high-user ones (3.52,4.00). Yet, the differences are small.

Thus, based on both the graphical and numerical overviews it seems that all of the variables except health are at least to some extent associated with the level of alcohol consumption as I hypotesized. Additionally, it has been investigated that there are no relevant correlations between the chosen variables. To further explore the associations a logistic regression analysis is carried out.

Fitting the model

A logistic regression model with four explanatory variables selected based on the preliminary interest is fitted to identify the ones related to higher than average student alcohol consumption. Logistic regression is basic approach for binary outcomes and its results provide probability estimates of the event happening (P(Y=1)).

#First model with four explanatory variables
m1<-glm(high_use~sex+absences+famrel+health,data=alc_study,family="binomial")
summary(m1)
## 
## Call:
## glm(formula = high_use ~ sex + absences + famrel + health, family = "binomial", 
##     data = alc_study)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2557  -0.8609  -0.6151   1.0698   2.1157  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.94806    0.58165  -1.630   0.1031    
## sexM         1.00440    0.24498   4.100 4.13e-05 ***
## absences     0.09375    0.02264   4.141 3.45e-05 ***
## famrel      -0.30625    0.12864  -2.381   0.0173 *  
## health       0.08331    0.08779   0.949   0.3426    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 423.95  on 377  degrees of freedom
## AIC: 433.95
## 
## Number of Fisher Scoring iterations: 4

The summary of the model confirm what could already be suspected based on the plots and numerical grouped summary. The variables sex, absences and famrel are significant at the 5 % level. The effect of health is not even borderline significant (p=0.3426). From the estimates it can also be captured that the correlation between family relationships and high_use is negative and the correlation between school absences and high_use is positive. Furthermore, being a male positively correlates with the outcome, i.e. increases the risk of being a high consumer of alcohol.

Odds, odds ratio and confidence intervals

To better understand the odds ratio which is used in logistic regression to quantify the relationship between an explanatory factor and the target variabe the term odds should be clarified. It is the probability divided by the opposite of that probability. Logically, the ratio of two odds is called the odds ratio.Odds higher than 1 mean that the factor is positively associated with event happening, and odds less than 1 refers to a negative, or, if the event is unwanted (e.g. death), a protective effect. To further complicate the matter, based on the primary equation, in the logistic regression model the target variable is not purely odds, but the log of odds. Thus, in the following table, firstly the odds are obtained by exponentiating the summary estimates and, secondly the profile likelihood-based confidence intervals are calculated by using the confint command from MASS package:

cbind(OR, CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 0.3874923 0.1218232 1.2018295
## sexM        2.7302676 1.6999004 4.4499308
## absences    1.0982852 1.0527088 1.1506747
## famrel      0.7362043 0.5708862 0.9470728
## health      1.0868775 0.9165274 1.2940911

Apart from the variable health, the confidence intervals do not include one, as suspected, referring to a statistically significant factor. If any given confidence interval would include 1 it would mean that the factor could have either a positive or a negative effect, or no effect at all on the risk of the outcome.

Second model

Next, the non-significant variable health is excluded and the model is fitted again.

#Second model with three explanatory variables
m2<-glm(high_use~sex+absences+famrel,data=alc_study,family="binomial")
summary(m2)
## 
## Call:
## glm(formula = high_use ~ sex + absences + famrel, family = "binomial", 
##     data = alc_study)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1998  -0.8427  -0.6062   1.0845   2.1356  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.71798    0.52862  -1.358   0.1744    
## sexM         1.02877    0.24341   4.227 2.37e-05 ***
## absences     0.09351    0.02266   4.127 3.68e-05 ***
## famrel      -0.29090    0.12744  -2.283   0.0225 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 424.86  on 378  degrees of freedom
## AIC: 432.86
## 
## Number of Fisher Scoring iterations: 4

Now all of the variables in the model are significant at the 5 % level.

Oddsratios_Confint
##                           2.5 %    97.5 %
## (Intercept) 0.4877368 0.1702903 1.3635949
## sexM        2.7976292 1.7476499 4.5467291
## absences    1.0980228 1.0524564 1.1504519
## famrel      0.7475910 0.5811204 0.9595347

The significant odds ratios reveal that the factorial variable (male) denoting the difference between the genders refers to a 2.8 (1.7 - 4.5) times higher odds for males to be high consumers of alcohol than females. The coefficient for males would actually be the intercept plus the estimate. with increasing absences the person is more likely to be a high alcohol consumer. And on the contrary, the better the family relationships the less likely the person is to be a high consumer of alcohol.

More specifically, to explain and interpret both the summary and the table with odds ratios and confidence intervals I take a more detailed look at the coefficient for absences, which is 0.09352. It can be interpreted as the expected change in log odds for a one-unit increase in the number of absences. The odds ratio can be calculated by exponentiating this value to get 1.0980288 which means that we expect to see about 10% increase in the odds of being in a high alcohol user, for a one-unit increase in absences.

Predictive power

The second model will be used to explore the predictive power of the model:

probabilities<-predict(m2,type="response")
predictions<-probabilities >0.5
# Add the probabilities
alc_study <- mutate(alc_study, probability = probabilities)
# Calculate a logical high use value based on probabilites.
alc_study <- mutate(alc_study, prediction = probability > 0.5)
table(high_use=alc_study$high_use,prediction=predictions) %>% addmargins %>% round(digits=2)
##         prediction
## high_use FALSE TRUE Sum
##    FALSE   256   12 268
##    TRUE     85   29 114
##    Sum     341   41 382
table(high_use=alc_study$high_use,prediction=predictions) %>% prop.table %>% addmargins() %>% round(digits=2)
##         prediction
## high_use FALSE TRUE  Sum
##    FALSE  0.67 0.03 0.70
##    TRUE   0.22 0.08 0.30
##    Sum    0.89 0.11 1.00

There are 256 true negatives and 29 true positives, 85 false negatives and 12 false positives. To conclude, the model predicts high consumption less frequently than it really is as can be demonstrated graphically:

hu <- as.data.frame(prop.table(table(alc_study$high_use)))
pred <- as.data.frame(prop.table(table(alc_study$prediction)))
pp1 <- ggplot(hu, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab('OBSERVED high use') + theme(legend.position = 'none')+
scale_fill_manual(values=c("lightblue","green"))+
 theme(
plot.title = element_text(color="green", size=14, face="bold.italic"),
axis.title.x = element_text(color="green", size=9, face="bold"),
axis.title.y = element_text(color="#993333", size=14, face="bold")
)

pp2 <- ggplot(pred, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab('PREDICTED high use') + theme(legend.position = 'none')+
scale_fill_manual(values=c("lightblue","orange"))+
 theme(
plot.title = element_text(color="red", size=14, face="bold.italic"),
axis.title.x = element_text(color="orange", size=8, face="bold"),
axis.title.y = element_text(color="#993333", size=14, face="bold")
)

multiplot(pp1, pp2, cols = 2)

Since we know how to make predictions with our model, we can also compute the average number of incorrect predictions. In our model it is 25.4% suggesting that the model is lacking accuracy.Logistic regression aims to minimize the incorrectly classified observations. However, the performance of the generated model is better than purely guessing or just tossing a coin, which has the probability of 50 % of predicting the one or another outcome.

#loss_func from DataCamp exercises
loss_func<-function(class,prob){
  n_wrong<-abs(class-prob)>0.5
  mean(n_wrong)
}
loss_func(class = alc_study$high_use, alc_study$probability)#Training error of the three variable model
## [1] 0.2539267

Cross validation

Cross-validation tests the model on unseen data, i.e. data not used for generating the model. The lower the value the more accurate the model. Cross-validation can also be used to compare models.

#Errors computed and stored
cv <- cv.glm(data = alc_study, cost = loss_func, glmfit = m2, K = 10)
testerror<-cv$delta[1]

A ten-fold cross-validation shows that on average approximately 26% of the observations are missclassified under our model with three explanatory variables: sex, famrel and absencees. The average number of wrong predictions in the cross-validation is thus about the same as in the DataCamp exercise (26%).

However, if the goal of a study like this would be to identify, and thereafter to implement measures to control risk factors for high alcohol usage we would not succeed very well. It seems that the model is still missing relevant risk factors. Thus, I will try to improve the model by including another variable, namely age, in the model.

mbetter<-glm(high_use~famrel+absences+sex+age,family=binomial,data=alc)
#Errors computed and stored
cvbetter <- cv.glm(data = alc, cost = loss_func, glmfit = mbetter, K = 10)
cvbetter$delta[1] # Print the average number of wrong predictions.
## [1] 0.2486911

Adding age to the model hardly improves the performance yielding just a little bit lower testing error.

Further cross validation and model comparisons

Model with 12 variables

A data set with 11 integer and one factorial variable (gender) is used as a whole to improve the predictive accuracy.

alcs<-alc[c("age","Medu","Fedu","failures","G3","sex","absences","famrel","health","goout","studytime","freetime","high_use")]
full<-glm(high_use~.,family=binomial,data=alcs)
#Errors computed and stored
cvfull <- cv.glm(data = alcs, cost = loss_func, glmfit = full, K = 10)
cvfull$delta[1] # Print the average number of wrong predictions.
## [1] 0.2277487

The full model based on this data set still has a reasonably high prediction error.

Model reduction

I choose not to compare model performance solely based on the number of variables, but rather based on the selection methods. To make a purposeful selection and model comparisons firstly stepAIC() and thereafter bestglm() are used. More information about the selection methods can be found in an article here.

library(leaps)
library(MASS)
library(bestglm)
scope<-stepAIC(full,scope=list(lower=~sex+absences+famrel,upper=full),trace=FALSE)
scope$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## high_use ~ age + Medu + Fedu + failures + G3 + sex + absences + 
##     famrel + health + goout + studytime + freetime
## 
## Final Model:
## high_use ~ sex + absences + famrel + goout + studytime
## 
## 
##         Step Df   Deviance Resid. Df Resid. Dev      AIC
## 1                                369   369.5715 395.5715
## 2       - G3  1 0.02581712       370   369.5973 393.5973
## 3 - failures  1 0.32319318       371   369.9205 391.9205
## 4 - freetime  1 0.60688507       372   370.5274 390.5274
## 5     - Fedu  1 0.55002583       373   371.0774 389.0774
## 6     - Medu  1 0.16331742       374   371.2407 387.2407
## 7      - age  1 1.06533948       375   372.3061 386.3061
## 8   - health  1 1.67027019       376   373.9764 385.9764
bestglm(alcs,IC="BIC",family=binomial)
## Morgan-Tatar search since family is non-gaussian.
## BIC
## BICq equivalent for q in (0.268888789905018, 0.514103143659991)
## Best Model:
##                Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -2.76826342 0.66169773 -4.183577 2.869578e-05
## sexM         1.01234164 0.25894570  3.909475 9.249710e-05
## absences     0.08167686 0.02199598  3.713263 2.046039e-04
## famrel      -0.39378406 0.14034903 -2.805748 5.019988e-03
## goout        0.76761101 0.12316480  6.232389 4.593740e-10

Based on the outputs of the two approaches best models are fitted. The former (AIC) one includes gender, absences, family relationships, going out with friends and study time and the latter (bestglm) gender, absences, family relationships and going out with friends.

hu <- as.data.frame(prop.table(table(alcs$high_use)))
predfull <- as.data.frame(prop.table(table(alcsfull$prediction)))
predAIC <- as.data.frame(prop.table(table(alcsAIC$prediction)))
predbest <- as.data.frame(prop.table(table(alcsbest$prediction)))

ppobs <- ggplot(hu, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab('Observed ') + theme(legend.position = 'none')+
scale_fill_manual(values=c("lightblue","green"))+
 theme(
plot.title = element_text(color="green", size=14, face="bold.italic"),
axis.title.x = element_text(color="green", size=8, face="bold"),
axis.title.y = element_text(color="#993333", size=14, face="bold")
)

ppfull <- ggplot(predfull, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab('Full model') + theme(legend.position = 'none')+
scale_fill_manual(values=c("lightblue","orange"))+
 theme(
plot.title = element_text(color="red", size=14, face="bold.italic"),
axis.title.x = element_text(color="orange", size=8, face="bold"),
axis.title.y = element_text(color="#993333", size=14, face="bold")
)

ppAIC <- ggplot(predAIC, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab('BestAIC model ') + theme(legend.position = 'none')+
scale_fill_manual(values=c("lightblue","darkgreen"))+
 theme(
plot.title = element_text(color="red", size=14, face="bold.italic"),
axis.title.x = element_text(color="darkgreen", size=8, face="bold"),
axis.title.y = element_text(color="#993333", size=14, face="bold")
)

ppbest<- ggplot(predbest, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab("Bestglm model") + theme(legend.position = 'none')+
scale_fill_manual(values=c("lightblue","blue"))+
 theme(
plot.title = element_text(color="blue", size=14, face="bold.italic"),
axis.title.x = element_text(color="blue", size=8, face="bold"),
axis.title.y = element_text(color="#993333", size=14, face="bold")
)

multiplot(ppobs, ppfull, ppAIC, ppbest,cols = 4)

Model Training error Test error
FULL model with 12 variables 0.2094241 0.2329843
AIC model with 5 variables 0.2172775 0.2225131
BEST model with 4 variables 0.2041885 0.2198953

Final model with the lowest testing error

The last model uses best subsets approach aiming to find out the best fit model from all possible subset models. It has only four explanatory variables: sex, absences, family relationships and going out with friends. Using cross-validation it yield a test error of just below 22%, i.e. predicting almost 78% right. However, there are still 14 false negatives and 64 false positives.

probabilities<-predict(mbest,type="response")
predictions<-probabilities >0.5
# Add the probabilities
alcbest <- mutate(alcs, probability = probabilities)
# Calculate a logical high use value based on probabilites.
alcbest <- mutate(alcbest, prediction = probability > 0.5)
table(high_use=alcbest$high_use,prediction=predictions) %>% addmargins %>% round(digits=2)
##         prediction
## high_use FALSE TRUE Sum
##    FALSE   254   14 268
##    TRUE     64   50 114
##    Sum     318   64 382

To conclude, the results of the final model as odds ratios with th corresponding confidence intervals:

stargazer(mbest, coef = list(OR.vector), ci = T, 
          ci.custom = list(CI.vector), p = list(p.values), 
          single.row = T,align=TRUE, type = "html",
          
          title="Logistic regression model: Variables related to alcohol consumption",column.labels=c("odds ratio(confint)"),
          covariate.labels=c("Gender(M)","School absences","Family relationships","Going out with friends"),
          dep.var.caption = "Model with the lowest testing error",
          dep.var.labels = "Dependent variable:More than average alcohol usage" )
Logistic regression model: Variables related to alcohol consumption
Model with the lowest testing error
Dependent variable:More than average alcohol usage
odds ratio(confint)
Gender(M) 2.752*** (1.668, 4.613)
School absences 1.085*** (1.040, 1.135)
Family relationships 0.674*** (0.511, 0.887)
Going out with friends 2.155*** (1.703, 2.764)
Constant 0.063*** (0.016, 0.223)
Observations 382
Log Likelihood -189.904
Akaike Inf. Crit. 389.809
Note: p<0.1; p<0.05; p<0.01

Ref. Fabio Pagnotta’s and Hossain Mohammad Amran’s Using Data Mining To Predict Secondary School Student Alcohol Consumption (2008), published by Department of Computer Science of the University of Camerino


4th WEEK: Clustering and classification

Cluster analysis is one of the main tasks of exploratory data mining and is thus the topic of this week’s exercise. Clustering techniques identify similar groups or clusters among observations so that members within any segment are more similar while data across segments are different. However, defining what is meant by that requires often a lot of contextual knowledge and creativity.

Introduction

The Boston data will be used. It can be loaded from the R package MASS.According to the ?Boston, the data frame has 506 rows (observations) of 14 columns (variables). Briefly, the data report several variables potentially explaining housing values around Boston. Our aim is to classify the included suburbs from Boston data set into classes based on their characteristics. The variables are:

  • crim: per capita crime rate by town

  • zn:proportion of residential land zoned for lots over 25,000 sq.ft.

  • indus:proportion of non-retail business acres per town.

  • chas:Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

  • nox:nitrogen oxides concentration (parts per 10 million).

  • rm:average number of rooms per dwelling.

  • age:proportion of owner-occupied units built prior to 1940.

  • dis:weighted mean of distances to five Boston employment centres.

  • rad:index of accessibility to radial highways.

  • tax:full-value property-tax rate per $10,000.

  • ptratio:pupil-teacher ratio by town.

  • black:1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

  • lstat:lower status of the population (percent).

  • medv:median value of owner-occupied homes in $1000s

Firstly, the data are loaded, glimpsed and thereafter summaries are printed.

# load data
data("Boston")
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

Summaries and plots

Basic variable characteristics

The descriptive summary includes values of skewness (a measure of the symmetry in a distribution) and kurtosis (measuring the “tail-heaviness” of the distribution) and already shows the separating quantile values for crime to be further used later.

tab1<-CreateTableOne(vars=c("crim","zn", "indus","chas","nox" ,"rm"   
 ,"age","dis" ,"rad" , "tax" ,"ptratio", "black"  
,"lstat" , "medv"), factorVars = c("rad", "chas"),data=Boston)
summary(tab1)
## 
##      ### Summary of continuous variables ###
## 
## strata: Overall
##           n miss p.miss  mean    sd median   p25   p75   min   max skew
## crim    506    0      0   3.6   8.6    0.3 8e-02   3.7 6e-03  89.0  5.2
## zn      506    0      0  11.4  23.3    0.0 0e+00  12.5 0e+00 100.0  2.2
## indus   506    0      0  11.1   6.9    9.7 5e+00  18.1 5e-01  27.7  0.3
## nox     506    0      0   0.6   0.1    0.5 4e-01   0.6 4e-01   0.9  0.7
## rm      506    0      0   6.3   0.7    6.2 6e+00   6.6 4e+00   8.8  0.4
## age     506    0      0  68.6  28.1   77.5 5e+01  94.1 3e+00 100.0 -0.6
## dis     506    0      0   3.8   2.1    3.2 2e+00   5.2 1e+00  12.1  1.0
## tax     506    0      0 408.2 168.5  330.0 3e+02 666.0 2e+02 711.0  0.7
## ptratio 506    0      0  18.5   2.2   19.1 2e+01  20.2 1e+01  22.0 -0.8
## black   506    0      0 356.7  91.3  391.4 4e+02 396.2 3e-01 396.9 -2.9
## lstat   506    0      0  12.7   7.1   11.4 7e+00  17.0 2e+00  38.0  0.9
## medv    506    0      0  22.5   9.2   21.2 2e+01  25.0 5e+00  50.0  1.1
##          kurt
## crim    37.13
## zn       4.03
## indus   -1.23
## nox     -0.06
## rm       1.89
## age     -0.97
## dis      0.49
## tax     -1.14
## ptratio -0.29
## black    7.23
## lstat    0.49
## medv     1.50
## 
## =======================================================================================
## 
##      ### Summary of categorical variables ### 
## 
## strata: Overall
##   var   n miss p.miss level freq percent cum.percent
##  chas 506    0    0.0     0  471    93.1        93.1
##                           1   35     6.9       100.0
##                                                     
##   rad 506    0    0.0     1   20     4.0         4.0
##                           2   24     4.7         8.7
##                           3   38     7.5        16.2
##                           4  110    21.7        37.9
##                           5  115    22.7        60.7
##                           6   26     5.1        65.8
##                           7   17     3.4        69.2
##                           8   24     4.7        73.9
##                          24  132    26.1       100.0
## 

Distribution plots

To get a better idea of the variables and their distributions some plots are generated.

#density plots for numerical variables
Boston %>%
  keep(is.numeric) %>%                     # keep only numeric columns
  gather() %>%                             # convert to key-value pairs
  ggplot(aes(value)) +                     # plot the values
    facet_wrap(~ key, scales = "free") +   # in separate panels
    geom_density()                         # as density

#histograms for integer variables
Boston %>%
  keep(is.integer) %>% 
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram(bins=10)

Due to variable characteristics (percents, proportions or function based values) it is understandable that most of them have rather uneven/ skewed distributions: e.g. proportion of black people (scaled proportion of blacks), indus (proportion of non-retail business acres), age (proportion of owner-occupied units built prior to 1940), proportion of land zond for very large lots (zn) and lstat (lower status of the population (percent)). On the contrary, dwelling size referring to the number of rooms (rm) is normally distributed and median value of owner-occupied homes (medv) can also be judged to be. Charles River dummy variable (chas) is binary (1/0) referring to the river crossing the area and the radial highways accessibility (rad) is an interval scaled index.

Crime rate per capita

#to get a better idea about variable crim it is plotted separately
plot(Boston$crim, col="red",pch=8, main="Crime rate per capita")
text(198,59," Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.01    0.08    0.26    3.61    3.68   88.98" )

ggplot(Boston, aes(x = crim)) +
  stat_density(aes(col="red"),position="identity",geom="line",size=2)+
  ggtitle("Crime rate per capita")+ theme(legend.position="none")

We are especially interested in the variable crim. Thus, it is visualized separately. Crime rate varies a lot between areas ranging from min=0.01 to max=88.98. Quite a few high outlier values among the 506 observations as can be seen in the plot contribute to the low average value of 3.61 and median value of 0.26. The distribution is strongly skewed to the left.

Variable correlations

To explore the relations between the variables of the data set pairwise scatter plots and a correlation plots are printed.

#scatterplots
pairs(Boston,lower.panel = NULL)

To me these scatter plots are not that informative, though. Thus, I will try another approach where the correlation chart presents simultanously both the direction (color) and the magnitude (size of the circle) as well as the values of the correlation.

#a more visual correlation matrix
cor_matrix<-cor(Boston) %>% round(2)
corrplot.mixed(cor_matrix,number.cex=0.65,tl.cex=0.6)

And, finally, I create just a conservative table of correlations with notions for significance levels with the help of this.

library(xtable)  
corstars <-function(x, method=c("pearson", "spearman"), removeTriangle=c("upper", "lower"),
                     result=c("none", "html", "latex")){
    #Compute correlation matrix
    require(Hmisc)
    x <- as.matrix(x)
    correlation_matrix<-rcorr(x, type=method[1])
    R <- correlation_matrix$r # Matrix of correlation coeficients
    p <- correlation_matrix$P # Matrix of p-value 
    
    ## Define notions for significance levels; spacing is important.
    mystars <- ifelse(p < .0001, "****", ifelse(p < .001, "*** ", ifelse(p < .01, "**  ", ifelse(p < .05, "*   ", "    "))))
    
    ## trunctuate the correlation matrix to two decimal
    R <- format(round(cbind(rep(-1.11, ncol(x)), R), 2))[,-1]
    
    ## build a new matrix that includes the correlations with their apropriate stars
    Rnew <- matrix(paste(R, mystars, sep=""), ncol=ncol(x))
    diag(Rnew) <- paste(diag(R), " ", sep="")
    rownames(Rnew) <- colnames(x)
    colnames(Rnew) <- paste(colnames(x), "", sep="")
    
    ## remove upper triangle of correlation matrix
    if(removeTriangle[1]=="upper"){
      Rnew <- as.matrix(Rnew)
      Rnew[upper.tri(Rnew, diag = TRUE)] <- ""
      Rnew <- as.data.frame(Rnew)
    }
    
    ## remove lower triangle of correlation matrix
    else if(removeTriangle[1]=="lower"){
      Rnew <- as.matrix(Rnew)
      Rnew[lower.tri(Rnew, diag = TRUE)] <- ""
      Rnew <- as.data.frame(Rnew)
    }
    
    ## remove last column and return the correlation matrix
    Rnew <- cbind(Rnew[1:length(Rnew)-1])
    if (result[1]=="none") return(Rnew)
    else{
      if(result[1]=="html") print(xtable(Rnew), type="html")
      else print(xtable(Rnew), type="latex") 
    }
} 
# Make table with centered columns
corstars(Boston,result="html")
crim zn indus chas nox rm age dis rad tax ptratio black lstat
crim
zn -0.20****
indus 0.41**** -0.53****
chas -0.06 -0.04 0.06
nox 0.42**** -0.52**** 0.76**** 0.09*
rm -0.22**** 0.31**** -0.39**** 0.09* -0.30****
age 0.35**** -0.57**** 0.64**** 0.09 0.73**** -0.24****
dis -0.38**** 0.66**** -0.71**** -0.10* -0.77**** 0.21**** -0.75****
rad 0.63**** -0.31**** 0.60**** -0.01 0.61**** -0.21**** 0.46**** -0.49****
tax 0.58**** -0.31**** 0.72**** -0.04 0.67**** -0.29**** 0.51**** -0.53**** 0.91****
ptratio 0.29**** -0.39**** 0.38**** -0.12** 0.19**** -0.36**** 0.26**** -0.23**** 0.46**** 0.46****
black -0.39**** 0.18**** -0.36**** 0.05 -0.38**** 0.13** -0.27**** 0.29**** -0.44**** -0.44**** -0.18****
lstat 0.46**** -0.41**** 0.60**** -0.05 0.59**** -0.61**** 0.60**** -0.50**** 0.49**** 0.54**** 0.37**** -0.37****
medv -0.39**** 0.36**** -0.48**** 0.18**** -0.43**** 0.70**** -0.38**** 0.25**** -0.38**** -0.47**** -0.51**** 0.33**** -0.74****

 

From all the information above it can be captured that there are several relevant correlations between the variables. Strong negative correlations exist between weighted mean distances to five Boston employment centres (dis) and proportion of non-retail business acres per town (indus) / nitrogen oxide (nox) / and older properties (age). Not surprisingly, a strong negative correlation between lower status of the population (percent) and median home value (medv) is seen. Strong positive correlations especially between index of accessibility to radial highways (rad) and full-value property-tax rate per $10,000 (tax) / proportion of non-retail business acres per town (indus) exist. Proportion of non-retail business acres per town (indus) is further positively correlated with nitrogen oxide (nox) and full-value property tax-rate (tax). Furthermore, one of our main interests, the crime rate is correlated with many of the variables: e.g. negatively with e.g. distance to employment centers (dis) and median home value (medv), positively with e.g. full-value property tax-rate (tax) and access to radial highways (rad). Thus, an increase in crime rate seems to be associated with an increasing highway accessibility index and property tax.

Scaling the dataset and categorising crime rate

Linear discriminant analysis is a method generating linear combinations to charachterize variable classes. To enable the method the data set needs to be standardized, i.e. all variables fit to normal distribution so that the mean of every variable is zero by ubtracting the column means from the corresponding columns and dividing the difference with standard deviation:

\[ scaled(x) = \frac{x-means(x)}{sd(x)} \]

#scale the dataset
boston_scaled <- as.data.frame(scale(Boston))

In comparison, the values of the latter table have decreased, and all mean values are converted to zero and standard deviations to 1.

In addition to scaling, a categorical variable of the scaled crime rate has to be created. Quantiles are used for this to yield four grouping values: low, medium low, medium high and high crime rates and thus four groups with approximatey equal numbers of observation each.

Next, the data set is randomly spit for the analysis to train (80%) and test (20%) sets. Thus, the train set has 404 and the test set 102 variables.

# create a quantile vector of crim, and use it to categorize crim
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c('low','med_low','med_high','high'))
# replace the original unscaled variable.
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
# explore the categorised variable.
table(boston_scaled$crim)
## 
##      low  med_low med_high     high 
##      127      126      126      127

To compare how the original values by the created groups differ between the groups a summary is created stratified by the different crime rate categories. There are many large differences, especially when comparing the high and low crime rate groups.

Boston2<-Boston
Boston2$crime<-boston_scaled$crim
CreateTableOne(vars=c("zn", "indus","chas","nox" ,"rm"     
 ,"age","dis" ,"rad" , "tax" ,"ptratio", "black"  
,"lstat" , "medv"), strata=c("crime"), test=FALSE, data=Boston2)
##                      Stratified by crime
##                       low            med_low         med_high      
##   n                      127            126             126        
##   zn (mean (sd))       33.85 (34.67)   9.11 (14.54)    2.41 (6.59) 
##   indus (mean (sd))     4.88 (4.28)    9.14 (5.80)    12.41 (6.57) 
##   chas (mean (sd))      0.04 (0.20)    0.06 (0.24)     0.12 (0.33) 
##   nox (mean (sd))       0.45 (0.05)    0.49 (0.06)     0.60 (0.11) 
##   rm (mean (sd))        6.60 (0.56)    6.19 (0.47)     6.36 (0.88) 
##   age (mean (sd))      43.65 (19.59)  59.03 (29.05)   80.18 (21.91)
##   dis (mean (sd))       5.64 (2.08)    4.54 (1.93)     3.00 (1.25) 
##   rad (mean (sd))       3.54 (1.59)    4.78 (1.50)     5.96 (4.28) 
##   tax (mean (sd))     283.83 (63.32) 327.83 (102.02) 356.32 (91.50)
##   ptratio (mean (sd))  17.50 (1.89)   18.32 (1.64)    17.84 (2.86) 
##   black (mean (sd))   391.25 (9.19)  386.14 (30.87)  364.64 (56.91)
##   lstat (mean (sd))     7.15 (2.79)   11.71 (5.53)    12.74 (6.91) 
##   medv (mean (sd))     27.37 (8.00)   22.51 (5.38)    24.10 (10.28)
##                      Stratified by crime
##                       high           
##   n                      127         
##   zn (mean (sd))        0.00 (0.00)  
##   indus (mean (sd))    18.11 (0.13)  
##   chas (mean (sd))      0.06 (0.23)  
##   nox (mean (sd))       0.68 (0.06)  
##   rm (mean (sd))        6.00 (0.69)  
##   age (mean (sd))      91.45 (9.95)  
##   dis (mean (sd))       2.00 (0.55)  
##   rad (mean (sd))      23.85 (1.69)  
##   tax (mean (sd))     663.93 (23.34) 
##   ptratio (mean (sd))  20.16 (0.49)  
##   black (mean (sd))   284.96 (147.79)
##   lstat (mean (sd))    19.00 (6.85)  
##   medv (mean (sd))     16.16 (8.63)

Fitting the model

LDA analysis on the train set

Linear Discriminant Analysis (LDA) model is carried out to classify the suburbs using the categorized crime rate as the target variable. Firstly, classification is performed on the training dataset, and thereafter the classes are predicted on the test data.

set.seed(123)
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)
# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

LDA (bi)plots

Based on the analysis results are plotted firstly plain, and thereafter as a LDA (bi)plot with the help of a specificifally generated “arrow”-function to add arrows. It has to be kept in mind that for plotting the classes have to tranformed from categorical to numeric.

#helper function for the biplot arrows.
lda.arrows <- function(x, myscale = 2, arrow_heads = 0.2, color = "deeppink", tex = 1, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
#crime classes to numeric for plotting
classes <- as.numeric(train$crime) 
#plotting the lda
p1<-plot(lda.fit, dimen = 2, col = classes, pch = classes)

#(bi)plot
p2<-plot(lda.fit, dimen = 2, col = classes, pch = classes)
#arrows 
lda.arrows(lda.fit) 

print(lda.fit) 
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2500000 0.2475248 0.2549505 0.2475248 
## 
## Group means:
##                  zn      indus        chas        nox          rm
## low       0.9432005 -0.8959713 -0.11640431 -0.8577158  0.44620905
## med_low  -0.1512978 -0.3010310 -0.07547406 -0.5591764 -0.11907536
## med_high -0.4023185  0.2609362  0.18636222  0.4392319  0.04779161
## high     -0.4872402  1.0149946 -0.07547406  1.0412268 -0.50541663
##                 age        dis        rad        tax     ptratio
## low      -0.8546026  0.8232513 -0.6953230 -0.7314517 -0.45097009
## med_low  -0.3467956  0.2952511 -0.5420083 -0.4653999 -0.03350366
## med_high  0.4318874 -0.4060664 -0.4087527 -0.2821208 -0.32611400
## high      0.7952399 -0.8387924  1.6596029  1.5294129  0.80577843
##                black       lstat        medv
## low       0.37835870 -0.76633833  0.51252310
## med_low   0.32163001 -0.18506538  0.01078532
## med_high  0.08166164  0.03210543  0.15583308
## high     -0.74993410  0.87136020 -0.65964311
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.08288224  0.74545012 -1.02758413
## indus    0.03371880 -0.27414286  0.12800485
## chas    -0.07688231 -0.04569133  0.03313223
## nox      0.24620783 -0.70692830 -1.31233352
## rm      -0.13308714 -0.06060304 -0.15792484
## age      0.29339946 -0.28245910 -0.17691125
## dis     -0.04642187 -0.28428720  0.07074842
## rad      3.40029809  0.91155778 -0.10269503
## tax      0.04070331 -0.03257583  0.61283757
## ptratio  0.11422801  0.09094945 -0.21802056
## black   -0.15410658  0.03229362  0.15457698
## lstat    0.19460392 -0.28843142  0.29476658
## medv     0.17982251 -0.40689877 -0.20349106
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9510 0.0368 0.0122

Prediction

lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class) %>% addmargins()
##           predicted
## correct    low med_low med_high high Sum
##   low       16      10        0    0  26
##   med_low    7      15        4    0  26
##   med_high   1       6       15    1  23
##   high       0       0        1   26  27
##   Sum       24      31       20   27 102

By crosstabulating the correct and predicted classes it can be seen that the model’s predictions are quite accurate with the high category in the test data. On the contrary, the low areas are not recognized that well, and both the medium low and medium high classes seem to be problematic. By reflecting the results to the graph it can be captured that the separation is clearest with regard to the highest class. Due to that the prediction accuracies are understandable.

K-means

In comparison to LDA, K-means is a clustering method that divides observations into clusters.

Eclidean and Manhattan distances

For K means clustering, the Boston dataset is rescaled, so that the distances are comparable. To examine the distance properties of the data and compare methods superficially both the Euclidian (geometric) and Manhattan (along the axes) distance summaries are printed.

data(Boston)
#center and standardize variables and make it a data frame
boston_scaled<-as.data.frame(scale(Boston))
#Euclidean distance matrix
dist_eu<-dist(boston_scaled)
#for comparison Manhattan distance matrix
dist_man<-dist(boston_scaled,method = "manhattan" )
#summaries 
summary(dist_eu)#Euclidian
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
summary(dist_man)#Manhattan
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

Preliminary K-means and determination of the optimal number of clusters

K-means algorith is exploratorily ran on the dataset using 5 clusters.

#kmeans using euclidean and five clusters
km <- kmeans(dist_eu, centers = 5)
pairs(boston_scaled, col = km$cluster,lower.panel = NULL)

At the first sight the plotted reslts look a little like fireworks. Before interpreting the plot in more detail the optimal number of clusters using the total within cluster sum of squares (WCSS) with the number of clusters ranging from 1 to 10 is estimated and the results visualized.

k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
#Plot the results
plot(1:k_max, twcss, type = 'b')

The most prominent change in the sum of squares happens at 2. However, there are still drops after two, too, but they only yield small improvements. Yet, I choose to carry out the k-means clustering again using 2 as the number of clusters.

K- means using the optimal number of clusters

km <- kmeans(dist_eu, centers = 2)
pairs(boston_scaled, col = km$cluster, lower.panel = NULL)

The new pairwise scatter plot with only two clusters looks better than the previous one. All data points are assigned to two red/black clusters. The clearer separation for the colours the more relevant for clustering the variable. Property tax and access to radial highways seem to discriminate quite well between the two clusters.

Bonus

K-means is performed on the scaled Boston data using 4 clusters. Therafter, LDA is fitted using the generated clusters as target classes. Biplot is printed.

set.seed(123)
data(Boston)
boston_scaled4 <- as.data.frame(scale(Boston,center=TRUE,scale = TRUE))
dist_eu4 <- dist(boston_scaled4)
km2 <-kmeans(dist_eu4, centers = 4)
#pairs(boston_scaled4, col = km$cluster, lower.panel = NULL)
boston_scaled4$classes<-km2$cluster
lda.fit2 <- lda(boston_scaled4$classes ~., data = boston_scaled4)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
plot(lda.fit2, dimen = 2, col= as.numeric(boston_scaled4$classes), pch=classes)
lda.arrows(lda.fit2, myscale = 2)

Super-Bonus

The given code is ran on the scaled train set. A matrix is created including projections of the data points.

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

3D plot without colours is created:

library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')

3D plot is created the categorized crime rate classes defining the colours:

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crim)

3D plot is created the last K-mean clusters defining the colours:

train$cluster<-km2$cluster[match(rownames(train),rownames(boston_scaled4))]
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=as.factor(train$cluster))

5th WEEK: Dimensionality reduction

Introduction

Data from the United Nations Development Programme are used in this exercise.They encompass e.g. longevity, health and well-being related criteria for the development of the country. Data have been further edited according to this script to finally yield a dataset of 155 countries and 8 variables. Gender inequality variables with regard to education and labour force are added, and thereafter Gender inequality- and Human development- datasets are joined by the variable country. In addition, rows with missing values are removed and only countries are included.

Variable labels and their explanations are described below:

Data are loaded and glimpsed.

## Observations: 155
## Variables: 8
## $ edu2.f_m      <dbl> 1.0072389, 0.9968288, 0.9834369, 0.9886128, 0.96...
## $ lab.f_m       <dbl> 0.8908297, 0.8189415, 0.8251001, 0.8840361, 0.82...
## $ edu.exp       <dbl> 17.5, 20.2, 15.8, 18.7, 17.9, 16.5, 18.6, 16.5, ...
## $ life.exp      <dbl> 81.6, 82.4, 83.0, 80.2, 81.6, 80.9, 80.9, 79.1, ...
## $ GNI           <int> 64992, 42261, 56431, 44025, 45435, 43919, 39568,...
## $ mat.mort      <int> 4, 6, 6, 5, 6, 7, 9, 28, 11, 8, 6, 4, 8, 4, 27, ...
## $ ad.birth.rate <dbl> 7.8, 12.1, 1.9, 5.1, 6.2, 3.8, 8.2, 31.0, 14.5, ...
## $ parl.prop     <dbl> 39.6, 30.5, 28.5, 38.0, 36.9, 36.9, 19.9, 19.4, ...

Summary and graphical overview

Label Variable
edu2.f_m Population with at least some secondary education (female/male ratio)
lab.f_m Labour force participation rate (female/male ratio)
edu.exp Expected years of schooling
life.exp Life expectancy at birth
GNI Gross national income (GNI) per capita
mat.mort Maternal mortality ratio(deaths per 100 000 live births)
ad.birth.rate Adolescent birth rate (births per 1 000 women ages 15-19)
parl.prop Share of seats in parliament (female)

Next, detailed summary statistics and box plots are printed.

## 
##      ### Summary of continuous variables ###
## 
## strata: Overall
##                 n miss p.miss    mean      sd  median    p25     p75   min
## edu2.f_m      155    0      0     0.9     0.2     0.9    0.7     1.0   0.2
## lab.f_m       155    0      0     0.7     0.2     0.8    0.6     0.9   0.2
## edu.exp       155    0      0    13.2     2.8    13.5   11.2    15.2   5.4
## life.exp      155    0      0    71.7     8.3    74.2   66.3    77.2  49.0
## GNI           155    0      0 17627.9 18543.9 12040.0 4197.5 24512.0 581.0
## mat.mort      155    0      0   149.1   211.8    49.0   11.5   190.0   1.0
## ad.birth.rate 155    0      0    47.2    41.1    33.6   12.6    72.0   0.6
## parl.prop     155    0      0    20.9    11.5    19.3   12.4    27.9   0.0
##                  max skew  kurt
## edu2.f_m           1 -0.8  0.65
## lab.f_m            1 -0.9  0.13
## edu.exp           20 -0.2 -0.28
## life.exp          84 -0.8 -0.08
## GNI           123124  2.2  7.23
## mat.mort        1100  2.1  4.43
## ad.birth.rate    205  1.1  1.01
## parl.prop         58  0.6 -0.03

Values from Finland with extreme value countries

Many of the variables have quite a few outliers meaning that the countries differ tremendeously with each other - as expected. For my own interest, and overall comparison of the countries with extreme values (max and min value countries of each variable according to the order of them) an additional table with an additional row with our own national values is printed.
edu2.f_m lab.f_m edu.exp life.exp GNI mat.mort ad.birth.rate parl.prop
Myanmar 1.50 0.91 8.6 66 4608 200 12.1 4.7
Chad 0.17 0.81 7.4 52 2085 980 152.0 14.9
Malawi 0.51 1.04 10.8 63 747 510 144.8 16.7
Syrian Arab Republic 0.73 0.19 12.3 70 2728 49 41.6 12.4
Australia 1.00 0.82 20.2 82 42261 6 12.1 30.5
Niger 0.31 0.45 5.4 61 908 630 204.8 13.3
Japan 1.01 0.69 15.3 84 36927 6 5.4 11.6
Swaziland 0.84 0.61 11.3 49 5542 310 72.0 14.7
Qatar 1.13 0.53 13.8 78 123124 6 9.5 0.0
Central African Republic 0.38 0.85 7.2 51 581 880 98.3 12.5
Sierra Leone 0.46 0.95 8.6 51 1780 1100 100.7 12.4
Belarus 0.94 0.79 15.7 71 16676 1 20.6 30.1
Niger1 0.31 0.45 5.4 61 908 630 204.8 13.3
Slovenia 0.98 0.83 16.8 80 27852 7 0.6 27.7
Rwanda 0.91 1.01 10.3 64 1458 320 33.6 57.5
Qatar1 1.13 0.53 13.8 78 123124 6 9.5 0.0
Tonga 0.99 0.72 14.7 73 5069 120 18.1 0.0
Finland 1.00 0.87 17.1 81 38695 4 9.2 42.5

The variables ranges are indeed large. Life exectancy is highest, not surprisingly, in Japan with the age of 84 and lowest in Swaziland being only 49 years. Extreme maternal mortality ratio is reported in Sierra Leone and adolescent birth rate in Niger. A variable related to wealth, namely general income, is highest in Qatar with the value of 123 124, and a huge difference exists to the poorest country, Central African Republic reporting a value of 581. Surprisingly, the female / male ratio for population with at least secondary school eduction is highest in Myanmar.

Variable correlations

 

The only variable that seems normally distributed is expected years of shooling. From the variable connectivity, it can be captured that there are several relevant correlations apart from labour force participation ratio and female share in parliament. Not unexpectedly, life expectancy is negatively correlated with maternal mortality and adolescent birth rate. On the contrary, it is positively correlated with expected years of education and general income. Expected years of education also has negative correlations, e.g. with maternal mortality. To encompass, it seems basically that the more educated women there is, the better the longevity, the higher the overall educational expectancy and general income and lower maternal mortality and adolescent birth rate, and, interestingly, two variables (lab.f_m and parl.prop) are only weakly correlated with any of the other variables.

Principal component analysis

To begin with, it has to clarified, that PCA is an unsupervised approach. This means that the directions of the generated components are identified without using a response variable (Y) to determine their direction. In other words PCA focuses on recognizing sets of characteristics without an association to any response variable.PCA extracts important variables in form of coponents from a large set of variables available in a dataset.The main aim to recognize relationships between these charasteristics, thus, extract low dimensional set of features from a high dimensional dataset with to capture as much information as possible. With fewer variables further visualization of the data is more meaningful, but the method can also be used to editing data for subsequent analyses. Components are expressed by relationships of the original variables, they do not correlate with each other and each is less important than the previous one in terms of explained variance.

PCA on unscaled data

According to the instructions, PCA will be run twice (with unscaled and scaled predictors). Firstly, the former analysis is carried out, and a biplot is created. Biplots are basically scatter plots using observations as x and 2 principal components as y coordinates. Labeled arrows connect original variables to the principal components and their length are proportional to the standard deviations. Small angle between a variable and a PC axis reffers to high positive correlation.

## Importance of components:
##                          PC1      PC2  PC3  PC4  PC5  PC6   PC7   PC8
## Standard deviation     18544 185.5219 25.2 11.5 3.77 1.57 0.191 0.159
## Proportion of Variance     1   0.0001  0.0  0.0 0.00 0.00 0.000 0.000
## Cumulative Proportion      1   1.0000  1.0  1.0 1.00 1.00 1.000 1.000

As shown in image below, first principal component is dominated by one variable only: GNI. This is due to high value of variance associated with the variable.

When the variables are scaled, we get a much better representation of variables in 2D space. Thus, the matrix used should be numeric and have standardized data.

Next, the data are scaled and PCA is ran again.

PCA on scaled data

It can be seen that scaling dramatically changes results, because the different scaling factors directly influence calculation of the PC components. GNI does not dominate the variance anymore, and the results can be interpreted: First principal component is a linear combination of original variables and captures the maximum variance in the dataset. It determines the direction of highest variability in the dataset. Further, the first principal component results in a line which is closest to the data minimizing the sum of squared distance between a data point and the line. The generated PC1 covers 53.6 percent of the variation here. Characteristics of the PC1 are high maternal mortality ratio and adolescent birth rate (positive loadings) as well as expected schooling years, life expectancy, female educational proportion and income (negative loadings). This component thus captures longevity and educational aspects. By looking at the graph, it can be further interpreted that when maternal mortality and adolescent birth rate are low education, longevity, women schooling and general income are high and vice versa. The second component is also a linear combination of original predictors and it aims to capture the remaining variance in the data set and is uncorrelated with the first component (the correlation between first and second component should is zero). In this example, PC2 covers 16.2 percent of the variation. PC2 encompasses labour force participation ratio and female share in parliament, which were already recognized as “different kind of”-variables at the first preliminary investigation of the data. It could be scrutinized as “gender equality”-component.

Multiple correscondence analysis

Corresponcence or multiple correspondence analysis can be used in dimensionality reduction in cases of categorical variables. MCA is a generalization of PCA and an extension of CA. Basically, cross-tabulations are used to provide input for graphically present the data. Methods can be used for visualization or pre-editing of the data.

Tea data and My tea data

We practice multiple corresponce analysis using the tea dataset and MCA() function that come in the package “FactoMineR” by Francois Husson, Julie Josse, Sebastien Le, and Jeremy Mazet. Additionally, I use “factoextra”.

## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

FactoMineR package is required with its tea dataset reporting a questionnare on tea drinking habits. From the collected 18 variables I choose to use altogether six categorical variables: where, work, How, how, age_Q and sex. To both see the categories and number (percentage) of respondents in each of them a summary is printed. Based on my interest I stratify it by gender:

My tea data summary and graphical overview

##                          Stratified by sex
##                           F           M            p      test
##   n                       178         122                     
##   where (%)                                         0.041     
##      chain store          114 (64.0)   78 ( 63.9)             
##      chain store+tea shop  52 (29.2)   26 ( 21.3)             
##      tea shop              12 ( 6.7)   18 ( 14.8)             
##   work = work (%)          58 (32.6)   29 ( 23.8)   0.128     
##   How (%)                                           0.433     
##      alone                121 (68.0)   74 ( 60.7)             
##      lemon                 19 (10.7)   14 ( 11.5)             
##      milk                  32 (18.0)   31 ( 25.4)             
##      other                  6 ( 3.4)    3 (  2.5)             
##   how (%)                                           0.182     
##      tea bag              100 (56.2)   70 ( 57.4)             
##      tea bag+unpackaged    61 (34.3)   33 ( 27.0)             
##      unpackaged            17 ( 9.6)   19 ( 15.6)             
##   age_Q (%)                                         0.001     
##      15-24                 65 (36.5)   27 ( 22.1)             
##      25-34                 26 (14.6)   43 ( 35.2)             
##      35-44                 25 (14.0)   15 ( 12.3)             
##      45-59                 39 (21.9)   22 ( 18.0)             
##      +60                   23 (12.9)   15 ( 12.3)             
##   sex = M (%)               0 ( 0.0)  122 (100.0)  <0.001

Graphical overview:

There seems to be some variation in each of the variables. The category with the lowest frequency seems to be in the “How”-variable (other; only 9 observations), however, not too low to potentially distort MCA analysis. Age distribution shows, that the young ones are the largest group having answered the questionnaire and among them females are overrepresented. On the other hand, in the group of 25-34 year olds there are a lot more males than there are females. Altogether, there are more female observations than there are males. However, there are respondents in each age group. Surprisingly, as I assume this dataset to originate from Great Brittain, most respondents have reported to use either tea bags or tea bags and unpacked tea, both. A minority uses unpackaged tea, which I would have thought to be “the right Brittish manner”. Most drink their tea alone, some use milk, a few lemon. There is no honey option at all, and sugar variable is recorded as a separate variable and is not used in this analysis. Most buy their tea in a chain store, and men visit tea shops more often than women - again surprising phenomena to me. And finally, tea is mostly drank outside work.

MCA

Next MCA analysis on the chosen tea data is carried out. To do that, firstly, a crosstabulated frequency table is standardized to yield relative frequencies across the cells to sum up to 1.0. The aim of a MCA analysis is to represent the entries in the table of relative frequencies in terms of the distances between individual rows and/or columns in a low-dimensional space.

The output of the MCA() function is a list including :

## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 300 individuals, described by 6 variables
## *The results are available in the following objects:
## 
##    name              description                       
## 1  "$eig"            "eigenvalues"                     
## 2  "$var"            "results for the variables"       
## 3  "$var$coord"      "coord. of the categories"        
## 4  "$var$cos2"       "cos2 for the categories"         
## 5  "$var$contrib"    "contributions of the categories" 
## 6  "$var$v.test"     "v-test for the categories"       
## 7  "$ind"            "results for the individuals"     
## 8  "$ind$coord"      "coord. for the individuals"      
## 9  "$ind$cos2"       "cos2 for the individuals"        
## 10 "$ind$contrib"    "contributions of the individuals"
## 11 "$call"           "intermediate results"            
## 12 "$call$marge.col" "weights of columns"              
## 13 "$call$marge.li"  "weights of rows"

and looks like this for the chosen tea-dataset:

## 
## Call:
## MCA(X = dftea, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.297   0.268   0.226   0.202   0.194   0.171
## % of var.             13.701  12.378  10.445   9.345   8.961   7.870
## Cumulative % of var.  13.701  26.078  36.524  45.869  54.830  62.700
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.163   0.157   0.125   0.122   0.100   0.080
## % of var.              7.506   7.230   5.764   5.629   4.603   3.695
## Cumulative % of var.  70.206  77.435  83.199  88.829  93.431  97.127
##                       Dim.13
## Variance               0.062
## % of var.              2.873
## Cumulative % of var. 100.000
## 
## Individuals (the 10 first)
##                         Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1                    | -0.161  0.029  0.015 | -0.352  0.154  0.073 |
## 2                    | -0.412  0.191  0.101 | -0.271  0.091  0.044 |
## 3                    | -0.381  0.163  0.098 | -0.168  0.035  0.019 |
## 4                    | -0.457  0.234  0.209 | -0.528  0.347  0.279 |
## 5                    | -0.204  0.047  0.032 | -0.594  0.439  0.277 |
## 6                    | -0.457  0.234  0.209 | -0.528  0.347  0.279 |
## 7                    | -0.161  0.029  0.015 | -0.352  0.154  0.073 |
## 8                    | -0.370  0.154  0.065 | -0.029  0.001  0.000 |
## 9                    |  0.270  0.082  0.025 |  0.780  0.756  0.212 |
## 10                   |  0.289  0.094  0.036 |  0.665  0.549  0.190 |
##                       Dim.3    ctr   cos2  
## 1                    -0.063  0.006  0.002 |
## 2                     0.082  0.010  0.004 |
## 3                    -0.373  0.205  0.094 |
## 4                     0.135  0.027  0.018 |
## 5                     0.076  0.008  0.004 |
## 6                     0.135  0.027  0.018 |
## 7                    -0.063  0.006  0.002 |
## 8                    -0.057  0.005  0.002 |
## 9                    -0.199  0.059  0.014 |
## 10                   -0.156  0.036  0.010 |
## 
## Categories (the 10 first)
##                          Dim.1     ctr    cos2  v.test     Dim.2     ctr
## chain store          |  -0.492   8.688   0.430 -11.337 |  -0.420   7.003
## chain store+tea shop |   0.343   1.720   0.041   3.518 |   1.369  30.275
## tea shop             |   2.254  28.536   0.565  12.994 |  -0.873   4.740
## Not.work             |  -0.011   0.005   0.000  -0.311 |  -0.196   1.702
## work                 |   0.028   0.013   0.000   0.311 |   0.481   4.167
## alone                |  -0.097   0.347   0.018  -2.297 |  -0.231   2.148
## lemon                |   0.890   4.889   0.098   5.409 |   0.548   2.053
## milk                 |  -0.160   0.302   0.007  -1.427 |   0.127   0.210
## other                |  -0.030   0.001   0.000  -0.090 |   2.099   8.215
## tea bag              |  -0.492   7.694   0.316  -9.724 |  -0.382   5.142
##                         cos2  v.test     Dim.3     ctr    cos2  v.test  
## chain store            0.313  -9.675 |  -0.068   0.216   0.008  -1.561 |
## chain store+tea shop   0.658  14.030 |   0.154   0.455   0.008   1.581 |
## tea shop               0.085  -5.034 |   0.032   0.008   0.000   0.187 |
## Not.work               0.094  -5.314 |   0.413   8.907   0.417  11.167 |
## work                   0.094   5.314 |  -1.010  21.807   0.417 -11.167 |
## alone                  0.099  -5.434 |  -0.089   0.380   0.015  -2.099 |
## lemon                  0.037   3.332 |   0.367   1.090   0.017   2.230 |
## milk                   0.004   1.131 |  -0.213   0.702   0.012  -1.900 |
## other                  0.136   6.383 |   2.077   9.527   0.133   6.315 |
## tea bag                0.191  -7.556 |   0.154   0.987   0.031   3.041 |
## 
## Categorical variables (eta2)
##                        Dim.1 Dim.2 Dim.3  
## where                | 0.694 0.676 0.009 |
## work                 | 0.000 0.094 0.417 |
## How                  | 0.099 0.203 0.159 |
## how                  | 0.597 0.461 0.051 |
## age_Q                | 0.299 0.073 0.717 |
## sex                  | 0.093 0.101 0.005 |

The dimdesc function might help to interpret the dimensions. It allows to see which variables the axes are the most linked to, i.e. which categories describe the best each axis.

## $`Dim 1`
## $`Dim 1`$quali
##          R2 p.value
## where 0.694 5.1e-77
## how   0.597 2.6e-59
## age_Q 0.299 7.8e-22
## sex   0.093 7.6e-08
## How   0.099 9.4e-07
## 
## $`Dim 1`$category
##                      Estimate p.value
## tea shop                 0.85 9.2e-56
## unpackaged               0.77 9.5e-49
## 25-34                    0.28 1.8e-09
## lemon                    0.40 3.1e-08
## M                        0.17 7.6e-08
## +60                      0.18 4.1e-03
## alone                   -0.14 2.1e-02
## chain store+tea shop    -0.20 3.9e-04
## F                       -0.17 7.6e-08
## 15-24                   -0.48 2.2e-21
## tea bag                 -0.56 2.1e-26
## chain store             -0.65 3.1e-38
## 
## 
## $`Dim 2`
## $`Dim 2`$quali
##          R2 p.value
## where 0.676 1.9e-73
## how   0.461 1.2e-40
## How   0.203 1.6e-14
## sex   0.101 1.7e-08
## work  0.094 5.6e-08
## age_Q 0.073 1.7e-04
## 
## $`Dim 2`$category
##                      Estimate p.value
## chain store+tea shop    0.696 1.8e-71
## tea bag+unpackaged      0.541 4.2e-40
## other                   0.758 4.1e-11
## F                       0.168 1.7e-08
## work                    0.175 5.6e-08
## 35-44                   0.233 3.9e-04
## +60                     0.129 3.6e-02
## 45-59                  -0.156 4.1e-02
## 25-34                  -0.156 2.7e-02
## lemon                  -0.045 7.9e-04
## unpackaged             -0.373 4.3e-07
## tea shop               -0.465 2.9e-07
## Not.work               -0.175 5.6e-08
## alone                  -0.449 2.7e-08
## M                      -0.168 1.7e-08
## tea bag                -0.168 2.0e-15
## chain store            -0.230 4.1e-26
Scree plot

To visualize the percentage of inertia explained by each MCA dimensions:

The first two dimensions of MCA explain only about 26% of variance. Thus, already at this point I think I could perhaps have chosen my variables better to explain each others variation more.

Biplots

To further clarify the MCA results graphical representation is used. Firstly, there is biplot showing the global pattern within the data. Observations are represented by blue points and variables by red triangles and labels. The distance between any observation points or variable points gives a measure of their similarity (or dissimilarity). Similar types of individuals are close on the map, as well as similar kinds of variables.

Next, a plot is created to visualize the correlation between variables and MCA principal dimensions:

The plot should help to identify variables that are the most correlated with each dimension. The squared correlations between variables and the dimensions are used as coordinates.

And finally, as is described in the exercise instructions: “The typical graphs show the original classes of the discrete variables on the same”map“, making it possible to reveal connections (correspondences) between different things that would be quite impossible to see from the corresponding cross tables (too many numbers!).”

Or by using this nice approach to display both the observations and the categories. Moreover, since some individuals will be overlapped, we can add some density curves with geom_density2d() to see those zones that are highly concentrated:

On this biplot the first two dimensions are shown.Variable categories with a similar profile are grouped together. Negatively correlated variable categories are positioned on opposite sides of the plot origin (opposed quadrants). We observe that there are a few categories located quie near to the center of the graph. Unpackaded tea and tea shops as well as tea bag and chain store categories are close to one another. Additionally, not work, alone and age category from 45 to 59 are located in one group.The youngest and the oldest respondents are the closest to each other with regard to the age groups, i.e. they have similar tea drinking habits.There seems to be one outlier category, those who drink tea with “other”" ways on the top of the plot. The first dimension captures mainly in what form people have their tea and where they buy it from. Individuals with high coordinates on the first component tend to by their tea in tea shops unpackaged and they are likely to drink it with lemon, whereas low coordinate-individuals buy tea bags in chain stores (more common~closer to the axis) and use milk. For the second dimension there are “in-between” individuals at the top: they do their either unpacked and teabag shopping in either the chain stores and the tea shops, and cannot really say how they drink it and thus describe it using “other”.

References: